During the lab, Hariprasath focused on assignment 2 and Lakshidaa focused on assignment 1. After the lab, both of us collaborated to complete the remaining tasks and to clarify each other’s doubts. Then we discussed together and answered the questions related to analysis and interpretation of the plots.
ubs_data = read.delim("prices-and-earnings.txt", header = TRUE)
After reading the data, we have filtered out columns : 1,2,5,6,7,9,10,16,17,18,19 and using the first column containing city names as the row labels. The resulting data frame looks as follows:
ubs_data = ubs_data[, c(1,2,5,6,7,9,10,16,17,18,19)]
rownames(ubs_data) = ubs_data[[1]]
ubs_data[1] = NULL
head(ubs_data, 3)
## Food.Costs... iPhone.4S.hr. Clothing.Index Hours.Worked Wage.Net
## Amsterdam 364 44.5 110.8 1755 69.4
## Athens 390 86.0 112.5 1822 40.0
## Auckland 497 51.0 79.2 1852 63.5
## Vacation.Days Big.Mac.min. Bread.kg.in.min. Rice.kg.in.min.
## Amsterdam 24 16 7 9
## Athens 23 30 13 26
## Auckland 20 16 17 8
## Goods.and.Services...
## Amsterdam 3034
## Athens 2605
## Auckland 3019
We create a heatmap of the data without any reordering of rows/columns.
ubs_scaled = scale(ubs_data )
plot_ly(x = colnames(ubs_scaled), y = rownames(ubs_scaled),
z = ubs_scaled, type = "heatmap",
colors = colorRamp(c("yellow", "red"))) %>%
layout(title = "Heatmap of Column variables vs Cities")
In this heatmap, it is quite difficult to spot any clusters as the colors in the rows and columns of data are varying a lot and very randomly mixed. It is also not possible to spot any outliers in this heatmap as it is difficult to identify the range of colors for each row/column. So, it is not possible to see any clusters and outliers.
After computing orders that optimize Hamiltonian Path Length and using Hiearchical Clustering (HC) as the optimization algorithm, we plotted two heatmaps based on the two types of distance measures.
Firstly, we compute orders for rows and columns using Hierarchical Clustering that optimizes Hamiltonian path length computed using Euclidean distance. Then, we create a heatmap with the reordered data.
plot_ly(x = colnames(ubs_reordered_euc),
y = rownames(ubs_reordered_euc),
z = ubs_reordered_euc, type = "heatmap",
colors = colorRamp(c("yellow", "red"))) %>%
layout(title = "Heatmap of Column variables vs Cities")
Secondly, we compute orders for rows and columns using Hierarchical Clustering that optimizes Hamiltonian path length computed using one-minus-correlation distance measure. Then, we create a heatmap with the reordered data.
plot_ly(x = colnames(ubs_reordered_cor),
y = rownames(ubs_reordered_cor),
z = ubs_reordered_cor, type = "heatmap",
colors = colorRamp(c("yellow", "red"))) %>%
layout(title = "Heatmap of Column variables vs Cities")
From both these heatmaps, it can be observed that there are broadly 2 clusters of cities. But within these clusters, the ordering based on Euclidean distance seems to have more homogeneous values compared to the one based on one-minus-correlation. For example, in the one-minus-correlation based heatmap, the cities Dubai, Hong Kong, Taipei, etc (at the left-bottom) seem to be dissimilar from nearby observations. So, the heatmap obtained by computing the Euclidean distance appears to produce more homogeneous clusters and is much easier to analyse than the other one.
From the heatmap based on Euclidean distance, it can be seen that (iPhone.4S.hr and Big.Mac.min) and (Wage.Net and Goods.and.Services) seem to be correlated. The columns Hours.Worked, Bread.kg.in.min, Rice.kg.in.min, iPhone.4S.hr and Big.Mac.min seem to be somewhat negatively correlated to the columns Clothing.Index, Food.Costs, Wage.Net and Goods.and.Services. The outlier in this heatmap could be the observations for the cities Beijing, Shangai, Bangkok and Mexico City based on the very low values for Vacation.Days. The cities Hong Kong, Seoul and Caracas show very high Food.Costs which do not follow the trend observed in other cities and hence, could be outliers.
Here, we create a heatmap for a permutation of rows and columns that optimizes Hamiltonian Path Length using Traveling Salesman Problem (TSP) as solver.
plot_ly(x = colnames(ubs_reordered_tsp),
y = rownames(ubs_reordered_tsp),
z = ubs_reordered_tsp, type="heatmap",
colors = colorRamp(c("yellow", "red"))) %>%
layout(title = "Heatmap of Column variables vs Cities
(reordered by TSP/Hamiltonian Path Length/Euclidean Distance)")
On comparing the heatmap given by the reordering using the TSP solver to the one generated by the HC solver, it can be seen that the one obtained using TSP is better. In the HC reordering, it can be seen that there are broadly two clusters of cities. But within these clusters, we can observe some variation. Whereas in the TSP reodering, there are more fine-grained clusters which are smaller but have more similar appearance.
We compute the values of Hamiltonian path length and Gradient measure achieved by the 2 approaches - HC and TSP for row permutations.
## [1] "Hamiltonian Path Length : 122.173895596007"
## [1] "Gradient Measure : 40636"
## [1] "Hamiltonian Path Length : 144.492476381981"
## [1] "Gradient Measure : 58432"
On comparing the Hamiltonian Path Length and the Gradient Measure between the row permutations of TSP and HC solvers, it can be observed that the Hamiltonian Path Length as well as the Gradient Measure are better optimized by the TSP solver. This shows that the TSP solver performs better in terms of optimization criteria also.
Firstly, we make parallel coordinate plots from unordered data.
plot_ly(data = ubs_data, type = 'parcoords',
dimensions = unord_dim_list) %>%
layout(title = "")
After manually adjusting the order of variables, we have arrived at the following order of variables for which the clusters appear more prominent. We have also approximately brushed the different clusters that can be observed with different colors (based on value of Wage.Net). Based on further analysis of the variables, these clusters (especially blue cluster) can be further refined by taking values of other variables as well into consideration.
plot_ly(data = ubs_data, type = 'parcoords',
line = list(color = ~cluster, colorscale = list(c(0,"red"), c(1,"blue"))),
dimensions = reord_dim_list) %>%
layout(title = "")
The cluster marked in red color seems to be very concentrated in certain areas for some variables and seems like a good cluster. In comparison, the cluster marked in blue color shows more variance and does not show a strong trend in many of the variables. The brushing is done based on the value of Wage.Net where blue cluster contains observations with value less than 24 and the rest are in red cluster.
Properties of cluster 1 (Red):
Wage.Net variable greater than 24. Hence, it contains the higher values of Wage.Net.Goods.and.Services (> 2000), Food.Costs (> 300), Vacation.Days (> 17) and lower values of iPhone.4S.hr (< 150), Big.Mac.min (< 175), Bread.kg.in.min (< 25).Food.Costs (270-590), iPhone.4S.hr (0-150), Big.Mac.min (0-40), Bread.kg.in.min (0-30), Rice.kg.in.min (0-21) seem to have the observations more concentrated for specific values and could be important in defining the cluster.Properties of cluster 2 (Blue):
Wage.Net variable less than 24. Hence, it contains the lower values of Wage.Net.iPhone.4S.hr (>120), Big.Mac.min (>30), Rice.kg.in.min (>17) and lower values of Goods.and.Services (<2200), Food.Costs (<400).Wage.Net (0-24), Goods.and.Services (0-2400), Food.Costs (0-480) seem to have the observations more concentrated for specific values and could be important in defining the cluster.Interpretation of clusters: The blue cluster appears to contain the cities having lower wages and hence, higher amount of work time is required to afford many of the items. On the other hand, the red cluster contains the cities which have higher wages and hence, much lower amount of time is required to afford many of the items.
Most prominent outlier: The most prominent outlier appears to be Caracas which belongs to the blue cluster. This city appears to have an unusually high value of Goods.and.Services compared to other observations which had similar values of Wage.Net. Many of the connecting lines for Caracas are very unique and has hardly any other cities showing a similar trend. Hence, it can be considered as an outlier.
Interpretation of outlier: Caracas has lower wages and the costs of food (except rice) as well as other goods like iPhone seem to be comparatively higher. So, the work time required to afford these items are much higher than other cities.
We create a radar chart diagram with juxtaposed radars created by using the data obtained using the HC solver. We have hidden the axis labels and tick labels to make it less cluttered and easier to compare the shapes.
Based on the similarity of radar diagrams, one cluster that can be identified from this radar chart diagram consists of the cities Barcelona, Madrid, Berlin, Helsinki, Frankfurt, Vienna, Munich and Lyon. Another cluster that can be seen consists of the cities Auckland, London and Dublin. These groups of cities have shapes which are very similar and could belong to the same cluster.
The most distinct outlier that can be observed from this radar chart diagram is Caracas since it has a radar diagram which appears very different compared to other radar diagrams which are in the nearby rows.
We think that the heatmap obtained using the TSP solver optimizing Hamiltonian Path Length computed using Euclidean distances was the best in analyzing this data. We were able to see the clusters more clearly since they were represented by colors. This improved the simplicity of observing the patterns compared to observing shapes or line orientations. The heatmap showed more fine-grained clusters as compared to the other methods. Similar clusters could be identified from the radar chart as well but that would be more time consuming as we would have to carefully go through many shapes and examine their similarity. Whereas, it is very efficient to interpret the heatmap which is concise in representing the data.
# Reading dataset from csv file
adult_data = read.csv("adult.csv", header = FALSE)
colnames(adult_data) = c("age", "workclass", "fnlwgt", "education", "education_num",
"marital_status", "occupation", "relationship", "race",
"sex", "capital_gain", "capital_loss", "hours_per_week",
"native_country", "income_level")
We make a scatter plot showing Hours per Week versus Age where observations are colored by Income level.
ggplot(adult_data) +
geom_point(aes(x = age, y = hours_per_week, color = income_level)) +
ggtitle("Relationship between Age, Hours per week and Income level")
It is problematic to analyze this plot because there is a lot of overplotting and occlusion. Often, it looks like the lower income level is plotted over the higher income level (we see much lesser blue dots compared to the next trellis plot). Because of this, we are not able to get a clear idea about the distribution of the 2 income levels.
Next, we make a trellis plot of Hours per Week versus Age conditioned on Income level.
ggplot(adult_data) +
geom_point(aes(x = age, y = hours_per_week, color = income_level)) +
facet_grid(income_level~.) +
ggtitle("Relationship between Age, Hours per week for each Income level")
From the trellis plot, we are able to notice the ditribution of the 2 income levels more clearly in the scatter plots.
Conclusions:
Here, we create a density plot of Age grouped by Income level.
ggplot(adult_data) +
geom_density(aes(x = age, group = income_level, fill = income_level), alpha = 0.5) +
ggtitle("Density plot of age for each income level")
Next, we create a trellis plot of density plots of Age conditioned on Marital Status.
ggplot(adult_data) +
geom_density(aes(x = age, group = income_level, fill = income_level), alpha = 0.5) +
facet_grid(marital_status~.) +
ggtitle("Density plot of age for each income level") +
theme(strip.text.y = element_text(angle = 0))
From the density plots of the 2 income levels, we can observe the differences in their age distributions. we see that the ages of the “>50K” income level almost resembles a normal distribution with mild skewness on the right. The “<=50K” on the other hand, is heavily skewed to the right. Among the 2, “>50K” income level has a higher peak and it’s mode occurs around age = 45. The “<=50K” income level has it’s mode around 25. We can conclude that in the lower income level, majority are younger people (age < 30) and in the higher income level, majority are middle aged people (ages 35-50).
From the trellis plot, we can observe how the distributions vary for each marital status.
Analysis and conclusions:
After filtering out the observations with capital loss equal to zero, we create a 3D scatter plot of Education num vs Age vs Capital loss.
plot_ly(capital_loss_data) %>%
add_markers(x = ~education_num, y = ~age,
z = ~capital_loss, marker = list(size = 3) ) %>%
layout(title = "Dependence of Capital loss vs Education num vs Age",
scene = list(xaxis = list(title = 'Education Num'),
yaxis = list(title = 'Age'),
zaxis = list(title = 'Capital Loss')))
Even if we adjust the size of the dots and change the viewing angle, the dots are very crowded and appear a bit messy. So, it is difficult to spot any existing trends. And while varying viewing angles, it is difficult to keep track of the orientation of the axes and make conclusions.
Next, we make a trellis plot in which each panel shows a raster-type-2d-density plot of capital loss vs Education num conditioned on 6 age intervals.
ggplot(capital_loss_data) +
stat_density_2d(geom = "raster",
aes(x = education_num, y = capital_loss,
fill = stat(density)), contour = FALSE) +
scale_fill_distiller(palette = "Spectral") +
facet_grid(cols = vars(cut_number(age, 6)))
Analysis:
Firstly, we make a trellis plot where each panel shows a scatter plot of Capital loss vs Education num conditioned on intervals of age obtained using cut_number.
ggplot(capital_loss_data) +
geom_point(aes(x = education_num, y = capital_loss)) +
facet_grid(vars(cut_number(age, 4))) +
ggtitle("Capital loss vs Education num for each age interval") +
theme(plot.title = element_text(hjust = 0.5))
Second, we make a trellis plot where each panel shows a scatter plot of Capital loss vs Education num conditioned on intervals of age obtained using Shingles with 10% overlap.
ggplot(shingles_df) +
geom_point(aes(x = education_num, y = capital_loss)) +
facet_grid(vars(age_interval)) +
ggtitle("Capital loss vs Education num for each age interval (with shingles)") +
theme(plot.title = element_text(hjust = 0.5))
Comparing the plots produced using cut_interval and shingles, we do not observe much visible difference.
Advantages: By using shingles, we can avoid the boundary effects which might occur at the boundary of the normal interval points. We might observe some trends within a panel and by using shingles, we can check if the trend is continued in the panels which are before and after as well. This kind of analysis would be useful for continuous variables.
Disadvantages Sometimes, the boundary effects could skew the local effects in the interval. Becuase, of this we might find it difficult to identify the local effects or make wrong conclusions about intervals.
knitr::opts_chunk$set(echo = TRUE)
# Loading required R packages
library(ggplot2)
library(plotly)
library(dplyr)
library(scales)
library(seriation)
#------------------------------------------------------------------------------
# Assignment 1: High-dimensional visualization of economic data
ubs_data = read.delim("prices-and-earnings.txt", header = TRUE)
# Task 1.1
ubs_data = ubs_data[, c(1,2,5,6,7,9,10,16,17,18,19)]
rownames(ubs_data) = ubs_data[[1]]
ubs_data[1] = NULL
head(ubs_data, 3)
#------------------------------------------------------------------------------
# Task 1.2
ubs_scaled = scale(ubs_data )
plot_ly(x = colnames(ubs_scaled), y = rownames(ubs_scaled),
z = ubs_scaled, type = "heatmap",
colors = colorRamp(c("yellow", "red"))) %>%
layout(title = "Heatmap of Column variables vs Cities")
#------------------------------------------------------------------------------
# Task 1.3.a
row_dist_euc = dist(ubs_scaled)
col_dist_euc = dist(t(ubs_scaled))
seriate_row_euc = seriate(row_dist_euc, "HC")
seriate_col_euc = seriate(col_dist_euc, "HC")
ord_row_euc = get_order(seriate_row_euc)
ord_col_euc = get_order(seriate_col_euc)
ubs_reordered_euc = ubs_scaled[rev(ord_row_euc),ord_col_euc]
plot_ly(x = colnames(ubs_reordered_euc),
y = rownames(ubs_reordered_euc),
z = ubs_reordered_euc, type = "heatmap",
colors = colorRamp(c("yellow", "red"))) %>%
layout(title = "Heatmap of Column variables vs Cities")
# Task 1.3.b
# row_dist_cor = as.dist(1 - abs(cor(t(ubs_scaled)))
# col_dist_cor = as.dist(1 - abs(cor(ubs_scaled)))
row_dist_cor = as.dist(1 - cor(t(ubs_scaled)))
col_dist_cor = as.dist(1 - cor(ubs_scaled))
seriate_row_cor = seriate(row_dist_cor, method = "HC")
seriate_col_cor = seriate(col_dist_cor, method = "HC")
ord_row_cor = get_order(seriate_row_cor)
ord_col_cor = get_order(seriate_col_cor)
ubs_reordered_cor = ubs_scaled[rev(ord_row_cor), ord_col_cor]
plot_ly(x = colnames(ubs_reordered_cor),
y = rownames(ubs_reordered_cor),
z = ubs_reordered_cor, type = "heatmap",
colors = colorRamp(c("yellow", "red"))) %>%
layout(title = "Heatmap of Column variables vs Cities")
#------------------------------------------------------------------------------
# Task 1.4
seriate_row_tsp = seriate(row_dist_euc, "TSP")
seriate_col_tsp = seriate(col_dist_euc, "TSP")
ord_row_tsp = get_order(seriate_row_tsp)
ord_col_tsp = get_order(seriate_col_tsp)
ubs_reordered_tsp = ubs_scaled[rev(ord_row_tsp), ord_col_tsp]
plot_ly(x = colnames(ubs_reordered_tsp),
y = rownames(ubs_reordered_tsp),
z = ubs_reordered_tsp, type="heatmap",
colors = colorRamp(c("yellow", "red"))) %>%
layout(title = "Heatmap of Column variables vs Cities
(reordered by TSP/Hamiltonian Path Length/Euclidean Distance)")
#Hamiltonian Path Length
ord_tsp = seriate(row_dist_euc, "TSP")
ham_tsp = criterion(row_dist_euc, order = ord_row_tsp, "Path_length")
paste("Hamiltonian Path Length : ", ham_tsp)
#Gradient Measure
gm_tsp = criterion(row_dist_euc, order=ord_tsp, "Gradient_raw")
paste("Gradient Measure : ", gm_tsp)
#Hamiltonian Path Length
ord_hc = seriate(row_dist_euc, "HC")
ham_hc = criterion(row_dist_euc, order = ord_hc, "Path_length")
paste("Hamiltonian Path Length : ", ham_hc)
#Gradient Measure
gm_hc = criterion(row_dist_euc, order = ord_hc, "Gradient_raw")
paste("Gradient Measure : ", gm_hc)
#------------------------------------------------------------------------------
# Task 1.5
# Function to create dimension list based on given variable order
get_dimension_list <- function(df, col_order){
dim_list = list()
i = 1
for (col in col_order){
dim_list[[i]] = list(label = col, values = df[[col]])
i = i + 1
}
return(dim_list)
}
# Create dimension list for unordered data
unord_dim_list = get_dimension_list(ubs_data, colnames(ubs_data))
# Create dimension list for reordered data
var_order = c("Cloting.Index", "Wage.Net", "Goods.and.Services...", "Food.Costs...",
"Vacation.Days", "iPhone.4S.hr.", "Big.Mac.min.",
"Bread.kg.in.min.", "Rice.kg.in.min.", "Hours.Worked")
reord_dim_list = get_dimension_list(ubs_data, var_order)
# Brush different clusters with different colors
ubs_data$cluster = 1
ubs_data[ubs_data$Wage.Net < 24 , "cluster"] = 2
plot_ly(data = ubs_data, type = 'parcoords',
dimensions = unord_dim_list) %>%
layout(title = "")
plot_ly(data = ubs_data, type = 'parcoords',
line = list(color = ~cluster, colorscale = list(c(0,"red"), c(1,"blue"))),
dimensions = reord_dim_list) %>%
layout(title = "")
#------------------------------------------------------------------------------
# Task 1.6
Ps = list()
nPlot = nrow(ubs_reordered_euc)
as.data.frame(ubs_reordered_euc) %>%
add_rownames(var = "group") %>%
mutate_each(funs(rescale), -group) -> ubs_radar
for (i in 1:nPlot){
Ps[[i]] = htmltools::tags$div(
plot_ly(type = 'scatterpolar',
r = as.numeric(ubs_radar[i,-1]),
theta = colnames(ubs_radar)[-1],
fill = "toself")%>%
layout(title = ubs_radar$group[i],
polar = list(radialaxis = list(title="", showticklabels = FALSE),
angularaxis = list(title="", showticklabels = FALSE))),
style = "width: 20%;")
}
h = htmltools::tags$div(style = "display: flex; flex-wrap: wrap", Ps)
htmltools::browsable(h)
#------------------------------------------------------------------------------
# Assignment 2: Trellis plots for population analysis
# Reading dataset from csv file
adult_data = read.csv("adult.csv", header = FALSE)
colnames(adult_data) = c("age", "workclass", "fnlwgt", "education", "education_num",
"marital_status", "occupation", "relationship", "race",
"sex", "capital_gain", "capital_loss", "hours_per_week",
"native_country", "income_level")
#------------------------------------------------------------------------------
# Task 2.1
ggplot(adult_data) +
geom_point(aes(x = age, y = hours_per_week, color = income_level)) +
ggtitle("Relationship between Age, Hours per week and Income level")
ggplot(adult_data) +
geom_point(aes(x = age, y = hours_per_week, color = income_level)) +
facet_grid(income_level~.) +
ggtitle("Relationship between Age, Hours per week for each Income level")
#------------------------------------------------------------------------------
# Task 2.2
ggplot(adult_data) +
geom_density(aes(x = age, group = income_level, fill = income_level), alpha = 0.5) +
ggtitle("Density plot of age for each income level")
ggplot(adult_data) +
geom_density(aes(x = age, group = income_level, fill = income_level), alpha = 0.5) +
facet_grid(marital_status~.) +
ggtitle("Density plot of age for each income level") +
theme(strip.text.y = element_text(angle = 0))
#------------------------------------------------------------------------------
# Task 2.3
capital_loss_data = adult_data[adult_data$capital_loss != 0,]
plot_ly(capital_loss_data) %>%
add_markers(x = ~education_num, y = ~age,
z = ~capital_loss, marker = list(size = 3) ) %>%
layout(title = "Dependence of Capital loss vs Education num vs Age",
scene = list(xaxis = list(title = 'Education Num'),
yaxis = list(title = 'Age'),
zaxis = list(title = 'Capital Loss')))
ggplot(capital_loss_data) +
stat_density_2d(geom = "raster",
aes(x = education_num, y = capital_loss,
fill = stat(density)), contour = FALSE) +
scale_fill_distiller(palette = "Spectral") +
facet_grid(cols = vars(cut_number(age, 6)))
#------------------------------------------------------------------------------
# Task 2.4
# Task 2.4.a
ggplot(capital_loss_data) +
geom_point(aes(x = education_num, y = capital_loss)) +
facet_grid(vars(cut_number(age, 4))) +
ggtitle("Capital loss vs Education num for each age interval") +
theme(plot.title = element_text(hjust = 0.5))
# Task 2.4.b
age_ranges = lattice::equal.count(capital_loss_data$age, number = 4, overlap = 0.1)
age_range_matrix = matrix(unlist(levels(age_ranges)), ncol=2, byrow = T)
age_range_df = data.frame(Lower = age_range_matrix[,1],
Upper = age_range_matrix[,2],
Interval = factor(1:nrow(age_range_matrix)))
index = c()
age_interval = c()
for(i in 1:nrow(age_range_df)){
interval_name = paste("[", age_range_df$Lower[i], ",",
age_range_df$Upper[i], "]", sep="")
indices_in_interval = which(capital_loss_data$age >= age_range_df$Lower[i] &
capital_loss_data$age <= age_range_df$Upper[i])
index = c(index, indices_in_interval)
age_interval = c(age_interval, rep(interval_name, length(indices_in_interval)))
}
shingles_df = capital_loss_data[index,]
shingles_df = cbind(shingles_df, age_interval)
ggplot(shingles_df) +
geom_point(aes(x = education_num, y = capital_loss)) +
facet_grid(vars(age_interval)) +
ggtitle("Capital loss vs Education num for each age interval (with shingles)") +
theme(plot.title = element_text(hjust = 0.5))